首页> 外文OA文献 >Focused crawling: a new approach to topic-specific Web resource discovery

【2h】

Focused crawling: a new approach to topic-specific Web resource discovery

机译：重点爬网：主题特定的Web资源发现的新方法

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

机译：万维网的快速增长对通用爬虫和搜索引擎提出了前所未有的扩展挑战。在本文中，我们描述了一种称为聚焦爬虫的新超文本资源发现系统。专注的搜寻器的目标是有选择地查找与预定义主题集相关的页面。不使用关键字而是使用示例性文档来指定主题。专注的爬网程序不是收集和索引所有可访问的Web文档以能够回答所有可能的即席查询，而是分析其爬网边界以查找最可能与该爬网最相关的链接，并避免Web的不相关区域。这可以节省大量的硬件和网络资源，并有助于使爬网保持最新。为了实现这种针对目标的爬网，我们设计了两个可指导爬网程序的超文本挖掘程序：一个分类器，用于评估爬虫的相关性。有关焦点主题的超文本文档，以及一个识别超文本节点的蒸馏器，这些超文本节点是对几个链接中许多相关页面的重要访问点。我们报告了使用不同主题特异性的多个主题进行的广泛的集中爬行实验。重点爬网稳定地获取相关页面，而标准爬网很快就迷路了，即使它们是从相同的根集开始的。集中爬网对于URL起始集中的大扰动具有鲁棒性。尽管存在这些干扰，它仍会发现大量重叠的资源。它还能够探索和发现与起始集相距数十个链接的宝贵资源，同时仔细修剪可能位于同一半径内的数百万个页面。我们的轶事表明，使用适度的桌面硬件，集中爬网对于针对特定主题构建高质量的Web文档集合非常有效。

著录项

作者
CHAKRABARTI, SOUMEN; BERG, MARTIN VAN DEN; DOM, BYRON;
展开▼
作者单位

展开▼
年度 1999
总页数
原文格式 PDF
正文语种 en
中图分类

相似文献

外文文献
中文文献
专利

1. Focused crawling f a new approach to topic-specific Web resource discovery [J] . Soumen Chakrabarti, Martin van den Berg, Byron Dom Computer Networks . 1999,第11a16期

机译：专注于爬网，这是一种用于主题特定的Web资源发现的新方法
2. Focused crawing: a new approach to topic-specific Web resource discovery [J] . Soumen Chakrabarti, Martin van den Berg, Byron Dom Computer Networks . 1999,第11a16期

机译：关注焦点：发现特定主题的Web资源的新方法
3. Approach in High Precision Topic-Specific Resource Discovery on the Web [J] . Ye Wei-guo, Lu Zheng-ding Wuhan University Journal of Natural Sciences . 2004,第1期

机译：Web上高精度主题特定资源发现的方法
4. Focused crawling: a new approach to topic-specific Web resource discovery [C] . Soumen Chakrabarti, Martin van den Berg, Byron Dom International world wide web conference . 1999

机译：重点爬行：特定于主题Web资源发现的新方法
5. Connecting link structure and content on the Web for effective focused crawling. [D] . Nickerson, Adam Stuart. 2003

机译：连接Web上的链接结构和内容，以进行有效的集中爬网。
6. Domain adaptation of statistical machine translation with domain-focused web crawling [O] . Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, -1

机译：统计机器翻译的领域适应和以领域为中心的网络爬网
7. Focused crawling: a new approach to topic-specific Web resource discovery [O] . Soumen Chakrabarti, Martin van den Berg, Byron Dom 1999

机译：聚焦爬行：特定于主题的Web资源发现的新方法
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

Focused crawling: a new approach to topic-specific Web resource discovery

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅